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A domain  decomposition  parallelization  of  the 
Fast  Marching  Method 

By  M.  Herrmann 


1.  Motivation  and  objectives 

Evolving  interfaces  play  an  important  role  in  a multitude  of  different  areas,  ranging 
from  fluid  mechanics,  combustion,  and  grid  generation  to  material  sciences,  semiconduc- 
tor manufacturing,  seismic  analysis,  and  control  problems  [see  Sethian  (19996)  for  a de- 
tailed overview].  Traditionally,  interfaces  have  been  treated  in  a Lagrangian  framework 
tracking  their  evolution  by,  for  example,  marker  particles  [see  among  others  Brackbill 
et  al.  (1988)].  In  recent  years,  however,  describing  the  topology  and  evolution  of  inter- 
faces by  Eulerian  partial  differential  equations  (PDE)  has  become  ever  more  popular 
since  this  approach  offers  certain  theoretical  and  computational  advantages  over  the  La- 
grangian formulation  (Sethian  19996).  Depending  on  the  type  of  problem,  two  different 
solution  strategies  for  the  Eulerian  approach  exist.  In  the  case  of  an  initial  value  prob- 
lem, level  set  methods  (Osher  & Sethian  1988),  or  alternatively  Volume-of-Fluid  methods 
(Noh  & Woodward  1976),  can  be  employed  to  solve  the  evolving  interface.  In  the  case  of 
a boundary  value  problem,  the  Fast  Marching  Method  (Sethian  1996a)  has  emerged  as 
the  efficient  solution  method. 

In  this  paper,  we  focus  on  two  boundary  value  problems  that  typically  arise  in  the 
numerical  implementation  of  level  set  methods,  namely  reinitialization  and  redistribution. 
In  level  set  methods,  an  iso-surface  of  the  level  set  scalar  G, 

G(x,t)  = Go  = const , (1.1) 

is  used  to  define  the  location  of  an  arbitrary  shaped  interface  T.  The  transport  equation 
for  the  scalar  G can  then  be  derived  from  simple  kinematic  considerations  as 

— + u-VG  = 0.  (1.2) 

at 

Since  this  so-called  level  set  equation  (1.2)  is  valid  only  at  the  location  of  the  interface 
itself,  the  choice  of  G outside  the  interface,  i.e.  G / Go,  is  in  principle  arbitrary.  However, 
for  numerical  reasons,  G is  generally  chosen  to  be  a distance  function, 

Enforcing  this  condition  is  usually  called  reinitialization. 

Furthermore,  some  quantities  S may  be  defined  only  at  the  location  of  the  interface. 
In  order  to  extend  these  quantities  to  the  whole  computational  domain,  they  are  set 
constant  in  the  interface  normal  direction, 

VS  • VG  = 0.  (1.4) 

This  procedure  is  called  redistribution,  since  it  redistributes  the  values  of  S from  the 
interface  into  the  surrounding  domain. 
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Both,  Eqs.  (1.3)  and  (1.4)  constitute  boundary  value  problems,  since  G and  5 are 
defined  on  F only.  Several  different  numerical  methods  exist  to  solve  these  equations. 
Among  these  are  brute  force  approaches  based  on  calculating  the  minimum  of  the  ge- 
ometric distance  between  each  grid  node  in  the  computational  domain  and  every  point 
on  the  interface  (Merriman  et  al.  1994),  and  PDE-based  approaches  like  the  iteration 
scheme  by  Sussman  et  al.  (1994).  While  the  former  are  computationally  very  expensive, 
the  latter  inadvertently  alter  both  the  location  of  interface  and  the  value  of  S on  the 
interface  due  to  their  iterative  nature.  In  the  case  of  the  reinitialization  equation  (1-3), 
supplementary  fixes  addressing  this  problem  relatively  successfully  have  been  proposed 
(Russo  & Smereka  2000;  Peng  et  al.  1999;  Sussman  et  al.  1998;  Sussman  & Fatemi  1999; 
Enright  et  al.  2002). 

An  alternative  approach  to  solving  Eqs.  (1.3)  and  (1.4)  is  the  so-called  Fast  Marching 
Method  (FMM).  It  was  originally  proposed  by  Tsitsiklis  (1994,  1995)  and  applied  to 
the  level  set  formulation  by  Sethian  (1996a)  and  Helmsen  et  al.  (1996).  The  FMM  is  a 
non-iterative  procedure  that  explicitly  makes  use  of  the  way  information  in  Eqs.  (1.3) 
and  (1.4)  propagates.  The  FMM  is  thus  theoretically  optimal  in  its  operation  count. 
Still,  typical  problem  sizes  of  state  of  the  art  numerical  simulations  in  general  require 
a domain  decomposition  approach  for  parallel  computing.  However,  the  FMM  is  highly 
sequential  and  hence  not  straightforward  to  parallelize  in  a domain  decomposition  sense. 
At  least  to  the  knowledge  of  the  author,  no  domain  decomposition  parallelization  of  the 
Fast  Marching  Method  has  been  published  yet. 

This  paper  is  structured  as  follows:  first,  the  standard,  sequential  FMM  is  reviewed. 
Then,  different  parallelization  strategies  are  discussed  and  a domain  decomposition  par- 
allelization is  proposed.  Thereafter,  some  preliminary  results  concerning  the  speedup  of 
the  proposed  method  are  presented  and  discussed.  Finally,  conclusions  are  drawn  and  an 
outlook  to  future  work  is  given. 


2.  The  sequential  Fast  Marching  Method 

In  this  section,  a short  overview  of  the  standard  Fast  Marching  Method  is  given.  For 
further  details,  the  interested  reader  is  referred  to  Sethian  (1996a),  Adalsteinsson  & 
Sethian  (1999),  and  Sethian  (1999a). 

The  idea  of  the  FMM  for  level  sets  is  to  first  solve  Eq.  (1.3)  and  use  its  solution  to 
then  solve  Eq.  (1.4)  (Adalsteinsson  & Sethian  1999).  Hence,  we  will  at  first  focus  on  the 
solution  of  Eq.  (1.3). 

To  solve  Eq.  (1.3)  correctly,  the  gradient  operator  has  to  be  approximated  by  upwind, 
entropy-satisfying  finite  differences  (Sethian  1999  a).  The  approximation  most  often  used 
is  due  to  Godunov  (Rouy  & Tourin  1992): 

|VG|«  [max(D£G,-D+lG,  0)2+ 

max(D-|G,-^G,0)2+  (2.1) 

max(D-G,-T»+^,0)2]1/2  , 

where  Df^k  are  difference  notations.  For  example,  the  first  order  approximation  is 

n-a/'-f  _ Gijk  ~ Gi-ljk  n+X/'Ti  Gi+ijk  — G^jk 

W ~ Ax  ’ D^G  ~ Ax 


(2.2) 
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(a)  Calculate  all  node  values  that  are  directly  adjacent  to  the  interface  and  tag  them  as 
accepted.  Tag  all  nodes  adjacent  to  these  accepted  nodes  as  close  and  all  others  as  far. 

(b)  Calculate  G of  all  close  nodes  by  Eq.  (2.3),  treating  G in  any  adjacent  close  or  far  node 
as  oo.  Set  the  loop  index  n = 1. 

(c)  Mark  as  accepted  the  close  node  ijk  with  the  smallest  G value,  denoted  by  Gn  — Gijk- 

(d)  Mark  all  far  nodes  adjacent  to  Gy*  as  close. 

(e)  Recalculate  the  G values  of  all  close  nodes  adjacent  to  Gijk  by  Eq.  (2.3),  treating  G in 
any  adjacent  close  or  far  node  as  oo. 

(f)  Set  n = n + 1 and  return  to  step  (c)  until  all  nodes  are  accepted. 

Table  1.  The  sequential  FMM  algorithm 


Thus,  the  discretized  version  of  Eq.  (1.3)  solved  in  the  FMM  reads  as 

[max(DJkG,-D+xkG,  0)2+ 

max(Z)-|C?,-I>^G,0)2+  (2.3) 

n 1/2 

max(Z)-G,-Z)^G,0)2]  =1. 

Provided  that  the  G values  of  all  nodes  neighboring  ijk  are  given,  Eq.  (2.3)  constitutes  a 
quadratic  equation  yielding  Gijk  itself.  The  simple,  albeit  inefficient  way  to  solve  Eq.  (2.3) 
throughout  the  whole  computational  domain  is  to  iteratively  update  each  node  in  the 
domain  by  Eq.  (2.3)  until  a stationary  solution  is  reached  (Rouy  & Tourin  1992). 

However,  this  approach  neglects  to  take  advantage  of  the  fact  that,  due  to  the  upwind 
structure  of  Eq.  (2.3),  information  propagates  only  from  smaller  to  larger  values  of  G. 
This  yields  attribute  1 of  the  FMM: 

ATTRIBUTE  1.  A node  value  Gjjk  is  determined  only  by  those  neighboring  nodes  of 
smaller  value.  It  can  thus  globally  depend  at  most  on  those  nodes  in  the  domain  that  are 
of  smaller  value. 

Attribute  1 implies  that  a smallest  node  is  fixed  and  cannot  change  its  value.  Hence, 
given  appropriate  boundary  conditions  for  Gijk  at  or  adjacent  to  the  interface  G — Go, 
updates  of  G^k  according  to  Eq.  (2.3)  can  be  confined  to  a narrow  band  around  the 
globally  smallest  values  that  sweeps  outward  to  ever  larger  values  of  Gijk-  For  details  see 
Sethian  (1996a),  Sethian  (1999a),  and  Adalsteinsson  & Sethian  (1999). 

In  summary,  this  leads  to  the  sequential  FMM  algorithm  given  in  Table  1.  The  algo- 
rithm is  executed  once  for  all  nodes  G?.fc  < Go  and  once  for  all  nodes  G9.fc  > Go,  where 
the  superscript  0 denotes  the  initial  values  of  G at  node  ijk.  Furthermore,  the  following 
attribute  of  the  Fast  Marching  Method  can  be  discerned: 

ATTRIBUTE  2.  The  sequential  loop  steps  (c)-(f)  sort  all  accepted  nodes  that  are  not 
initially  accepted  in  step  (a)  in  ascending  order,  i.e.  Gn+1  > Gn. 

Using  a heap  sort  algorithm  with  back  pointers  (Sethian  19966)  to  locate  the  smallest 
value  G in  step  (c)  makes  the  sequential  FMM  algorithm  highly  efficient  with  a theoretical 
operation  count  of  O(NlogN). 
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(Am)  Perform  steps  (a)  and  (b)  of  the  sequential  FMM  algorithm. 

(Bm)  Send  all  nodes  with  G°jk  < Go  to  process  #0,  all  nodes  with  G°jt.  > Go  to  process  #1. 
(Cm)  Perform  sequential  FMM  algorithm  steps  (c)-(f). 

(Em)  Receive  results  for  nodes  G°jk  < Go  from  process  #0  and  results  for  nodes  G°ijk  > Go 
from  process  #1. 

Table  2.  The  parallel  FMM  algorithm  V\ 


3.  Parallelizing  the  Fast  Marching  Method 

In  this  section,  different  strategies  to  parallelize  the  FMM  are  discussed.  First,  a simple 
parallelization  strategy  not  based  on  domain  decomposition  is  given,  followed  by  the 
discussion  of  several  domain  decomposition  parallelization  approaches. 

3.1.  Go-based  parallelization 

The  trivial  way  to  parallelize  the  FMM  algorithm  described  in  table  1 is  to  execute 
the  complete  sequential  algorithm  for  all  nodes  G9fc  < G0  and  for  all  nodes  G°jfc  > 
Go  in  parallel,  since  neither  region  influences  the  other.  The  resulting  algorithm  V\  is 
summarized  in  Table  2. 

This  parallelization  strategy  has  two  obvious  drawbacks.  First,  the  parallelization  is 
limited  to  two  parallel  processes,  since  only  two  independent  regions  exist.  Secondly, 
depending  on  the  problem  size,  memory  resource  problems  arise,  because  both  processes 
need  to  work  on  the  whole  computational  domain. 

3.2.  Domain  decomposition  parallelization 

Domain  decomposition  parallelization  of  the  sequential  FMM  algorithm  poses  two  prob- 
lems. First,  the  globally  smallest  close  value  has  to  be  located  in  step  (c).  Although  this 
procedure  is  non-local  by  definition,  it  still  is  easy  to  parallelize,  since  each  domain  can 
compute  its  locally  smallest  value  independently,  and  then  simply  use  these  to  find  a 
global  minimum.  Second,  a new  globally  smallest  value  Gn  can  be  found  in  step  (c)  only 
after  steps  (d)-(e)  of  the  previous  loop  step  n — 1 have  been  executed. 

Nevertheless,  leaving  at  first  efficiency  considerations  aside,  domain  decomposition  is 
straightforward.  In  order  to  simplify  the  notation  in  the  following,  we  only  discuss  the 
domain  decomposition  into  two  neighboring  domains  Vm  and  2?m+1.  All  arguments,  how- 
ever, apply  analogous  to  the  three-dimensional  domain  decomposition  into  an  arbitrary 
number  of  domains.  Figure  1 depicts  some  naming  conventions  that  are  employed  in 
the  following.  Each  domain  Vm  is  extended  by  r ghost  nodes  in  the  domain  boundary 
normal  direction,  where  r is  the  spatial  order  of  the  difference  approximation,  Eq.  (2.2). 
Here,  only  the  first  order  approximation  is  employed.  Hence,  each  domain  is  extended  as 
depicted  in  Fig.  1. 

Assuming  that  each  process  m = 0...Np  works  only  on  part  Vrn  of  the  whole  compu- 
tational domain,  Table  3 summarizes  the  parallel  FMM  algorithm  Vi-  Note  that  V2  has 
to  be  executed  twice  on  each  process,  once  for  all  nodes  G?-fc  < Go  and  once  for  all  nodes 
G°.fc  > G0.  In  essence,  disregarding  step  (A),  algorithm  V2  is  a domain  decomposed 
sequential  algorithm,  because  globally  only  one  single  node  in  a single  domain  is  updated 
per  loop  step,  while  all  other  domains  are  waiting  idle.  However,  algorithm  V2  introduces 
a clearly  defined  inter-domain  communication  boundary.  Recalling  that  the  update  of 
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Gm  $m+r' 

r I | 

Domain  T> m T T Domain  X>m+l 


Bm  Gm+l 

Figure  1.  Domain  naming  conventions. 


(A)  Perform  steps  (a)  and  (b)  of  the  sequential  FMM  algorithm. 

(B)  Locate  the  locally  smallest  close  G value;  denote  it  as  Gvm  ■ 

(C)  Find  the  globally  smallest  close  G value,  Gg  — min(G®0  JVp),  and  mark  it  as  accepted. 
If  Gg  6 Dm,  go  to  next  step,  else  go  to  step  (F). 

(D)  Perform  sequential  FMM  algorithm  steps  (c)-(e)  for  Gg- 

(E)  If  any  of  the  close  nodes,  updated  in  step  (e)  of  the  sequential  FMM,  belong  to  Bm, 
communicate  them  to  Gm+i  of  domain  Vm+i  ■ 

(F)  Set  n = n + 1.  Return  to  step  (B)  until  all  local  nodes  are  accepted. 

Table  3.  The  parallel  FMM  algorithm  V2 


Gijk,  Eq.  (2.3),  using  first  order  approximations,  Eq.  (2.2),  does  involve  at  most  the 
directly  adjacent  nodes  in  the  ±i,  ±j,  and  ±k  direction,  any  local  node  in  domain  T>m 
can  be  updated  correctly,  provided  that  all  ghost  nodes  Qm  — Bm+i  are  known.  Hence, 
only  changes  of  Bm+\  need  to  be  communicated  to  Gm,  as  is  done  in  step  (E)  above. 

Taking  these  arguments  into  account,  one  can  in  fact  avoid  the  strict  sequential  nature 
of  algorithm  TV  Looking  for  example  at  domain  Vm,  as  long  as  no  change  in  Gm  occurs, 
the  locally  smallest  close  value  Gpm  can  be  moved  into  a local  accepted  group  and 
updates  of  the  nodes  adjacent  to  Gpm  can  be  performed  according  to  the  sequential 
FMM  algorithm  steps  (d)  and  (e). 

However,  special  care  must  be  taken  whenever  a node  belonging  to  Qm,  for  example 
G^k,  changes.  If  G^k  changes  to  accepted  status,  then,  recalling  attribute  1,  all  locally 
accepted  nodes  belonging  to  Vm  that  are  smaller  than  or  equal  to  G^k  cannot  be  influ- 
enced by  this  change.  Conversely,  all  locally  accepted  nodes  belonging  to  Vm  larger  than 
G^k  might  be  wrong,  since  they  could  depend  on  Gy*,.  Hence,  to  allow  for  a consistent 
algorithm,  each  domain  has  to  be  able  to  rollback  to  its  state  at  the  beginning  of  loop 
step  p,  where  p is  such  that  Gp_1  < G^k  < Gp.  The  same  argument  applies,  if  Gijk 
changes  from  accepted  to  close  status  through  a rollback  operation  in  domain  Vm+i. 
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(A)  Perform  steps  (a)  and  (b)  of  the  sequential  PMM  algorithm. 

(B)  Check,  if  any  node  belonging  to  Qm  changed  status  with  new  value  Gb-  If  so,  rollback  to 
state  Sp , where  p is  given  by  Gp~1  <Gb<Gp.  Mark  Gb  as  close  and  insert  it  into  list 
£b-  Set  n = p. 

(C)  Locate  the  locally  smallest  close  G value  (including  list  Cb);  denote  it  as  Gj>m . 

(D)  Perform  steps  (c)-(e)  of  the  sequential  FMM  algorithm  for  Gvm  ■ 

(E)  If  node  Gvm  or  any  of  its  adjacent  nodes  updated  in  step  (D)  belongs  to  Brn , communicate 
them  to  Qm+ 1 of  domain  Vm+ 1. 

(F)  Store  the  current  state  as  Sn. 

(G)  Set  n=n+l.  Return  to  step  (B)  until  all  local  nodes  are  accepted. 

(H)  Wait  until  all  other  domains  reach  step  (H)  or  a node  belonging  to  Gm  changes  status. 
In  the  latter  case,  go  to  step  (B). 

Table  4.  The  parallel  FMM  algorithm  Va 


These  considerations  lead  to  the  domain  decomposed  parallel  algorithm  Vz  summarized 
in  Table  4. 

The  drawback  of  algorithm  Vz  is  two-fold.  First,  the  complete  state  of  the  local  domain 
has  to  be  stored  at  every  loop  step  in  order  to  allow  for  a possible  rollback  to  this  state. 
Second,  any  change  in  status  of  a node  belonging  to  Gm  leads  to  a rollback  operation  in 
domain  Vm.  To  overcome  these  shortcomings,  we  will  make  use  of  the  following  additional 
attribute  of  the  FMM: 

ATTRIBUTE  3.  Let  Gp-k  be  the  solution  to  Eq.  (2.3)  at  loop  step  p.  If  any  single  one 
of  the  adjacent  nodes  i'j'k'  becomes  smaller,  i.e.  GJ(,.,k,  < Gp,.,k,  with  p'  > p,  then  a 
subsequent  update  of  node  ijk  by  Eq.  (2.3)  yields  Gp-k  < G?fc . 

The  proof  of  attribute  3 is  straightforward.  Attribute  3 implies  that  any  node  which 
has  been  locally  accepted  at  loop  step  p and  is  then  rolled  back  to  status  close,  can  retain 
its  G?fc  value,  because  any  subsequent  update  will  either  decrease  its  value  or  leave  it 
unchanged.  Its  change  back  to  accepted  status  at  a later  loop  step  is  thereby  uninfluenced. 
Following  the  same  line  of  argument,  all  neighboring  close  nodes  can  also  retain  their 
G i'j'k1  values.  Thus,  step  (B)  of  Vz  needs  to  rollback  only  the  status  of  the  nodes  from 
accepted  back  to  close,  but  does  not  need  to  rollback  their  node  values.  Furthermore, 
attribute  3 implies  that  only  the  change  to  accepted  status  of  a node  in  Gm  needs  to 
initiate  a rollback. 

Incorporating  attribute  3,  the  final  domain  decomposed  parallel  algorithm  Vi  is  given 
in  Table  5.  Obviously,  the  efficiency  of  algorithm  Vi  depends  on  the  required  amount  of 
inter-domain  communication,  i.e.  how  often  nodes  belonging  to  Bm  change  to  accepted 
status,  and  how  many  rollback  operations  are  required  in  step  (B).  If  the  solution  of 
Eqs.  (1.3)  and  (1.4)  is  required  only  up  to  a certain  distance  T away  from  the  G = Go 
interface,  then,  naturally,  only  those  nodes  within  this  band,  |G?.fc  — Go|  < T,  need  to  be 
considered  in  the  FMM  algorithm.  Depending  on  the  geometry  of  the  G = Go  interface, 
this  so-called  narrow  band  approach  (Peng  et  al.  1999)  may  drastically  reduce  the  nec- 
essary inter-domain  communication,  as  illustrated  in  Fig.  2.  In  fact,  as  long  as  there  are 
no  nodes  ijk  belonging  to  Bm  with  |G9.fc  — G0|  < T,  no  inter-domain  communication  is 
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(A)  Perform  steps  (a)  and  (b)  of  the  sequential  FMM  algorithm. 

(B)  Check,  if  any  node  belonging  to  Qm  changed  status  to  accepted  and  decreased  its  value 
from  a previously  accepted  value  to  the  new  value  Gb.  If  so,  rollback  by  marking  all  nodes 
Gp'"n  as  close,  where  p is  given  by  Gv~l  <GB<  Gp.  Mark  Gb  as  close  and  insert  it 
into  list  Cb.  Set  n = p. 

(C)  Locate  the  locally  smallest  close  G value  (including  list  Cb)',  denote  it  as  Gvm- 

(D)  Perform  steps  (c)-(e)  of  the  sequential  FMM  algorithm  for  Gvm- 

(E)  If  node  Gvm  belongs  to  Bm,  communicate  it  to  Qm+\  of  domain  Vm+i. 

(F)  Set  n— n+1.  Return  to  step  (B)  until  all  local  nodes  are  accepted. 

(G)  Wait  until  all  other  domains  reach  step  (G)  or  a node  belonging  to  Qm  changes  to  accepted 
status.  In  the  latter  case  go  to  step  (B). 

Table  5.  The  parallel  FMM  algorithm  Vi 


Domain  Z>m  Domain  X>m+1 
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Figure  2.  Narrow  band  approach. 


required  at  all  and  the  problem  completely  decouples.  Although  in  most  applications  this 
is  not  the  case,  the  narrow  band  approach  can  still  reduce  the  number  of  inter-domain 
communications  substantially. 


4.  Redistribution 

The  preceding  sections  dealt  exclusively  with  the  solution  of  the  reinitialization  equa- 
tion (1.3)  as  the  prerequisite  for  solving  the  redistribution  equation  (1.4).  The  basic  idea 
in  solving  Eq.  (1.4)  is  to  again  confine  its  solution  to  a small  band  around  the  globally 
smallest  Gy*  values  that  marches  outward  to  ever  larger  values  of  Gy*. 

To  this  end,  first,  initial  values  of  5 have  to  be  calculated  at  all  nodes  directly  adjacent 
to  the  G = Go  interface  by  either  first  order  approximations  (Adalsteinsson  & Sethian 
1999)  or  higher  order  schemes  (Chopp  2001).  Then,  Eq.  (1.4)  is  approximated  by 

A-  (D-.£G”  • D-‘S)  + A*  (bJJG“  ■ Djjs)  + 

A'"  (l>SP»* ' D$S)  + A+»  (DJJG*  • D+IS)  + 

A-  (oficr  ■ D-;ts)  + a*’  (p±‘kcr  ■ D*;ks ) = o 


(4.1) 
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Figure  3.  Optimal  domain  decomposition  into  2,  4,  and  8 domains. 


Here,  the  switch  A *is  given  by 


1 : max  -D+*kG",o)  = D~*kG' 

0 : otherwise  , 


(4.2) 


with  all  other  A defined  accordingly.  The  switch  A ensures  that  only  those  nodes  are 
used  to  evaluate  Sy*  in  Eq.  (4.1)  that  also  contributed  to  the  update  of  G™-k,  Eq.  (2.3). 
In  every  FMM  loop  step,  Eq.  (4.1)  is  only  solved  for  the  node  that  changes  status  to 
accepted  in  step  (c)  of  the  sequential  FMM.  If  this  node  belongs  to  Bm , its  new  value  Sn 
has  to  be  communicated  to  the  corresponding  ghost  node  in  domain  Vm+ j in  step  (E) 
of  the  parallel  FMM  algorithm 


5.  Results 


To  evaluate  the  performance  of  the  proposed  parallel  FMM  algorithm  Vi  to  solve 
Eqs.  (1.3)  and  (1.4),  the  results  of  four  different  sets  of  computations  are  presented  in  the 
following.  All  four  sets  are  three-dimensional  and  based  upon  the  same  interface  geometry, 
composed  of  a sphere  with  radius  Ro  = 0.25  and  center  located  at  xc  = (0.5, 0.5, 0.5). 
The  level  set  scalar  field  representing  this  sphere  is  given  by 

G(x)  = Rq  — \x  — xc\ . (5.1) 


The  distribution  of  S'  on  the  interface  is  set  to 


S(x)  = cos  ^ arc  tan  ^ S*n 


arctan 


X X Q 


V{y-  Vc)2  + (z-  Zc)2 , 


(5.2) 


All  computations  are  performed  on  a global  domain  of  size  [0,l]x[0,l]x[0,l]  discretized  by 
192x192x192  equidistant  cartesian  grid  nodes. 


5.1.  Optimal  domain  decomposition 

As  mentioned  in  section  1,  information  in  Eqs.  (1.3)  and  (1.4)  propagates  in  the  interface 
normal  direction.  Hence,  a domain  decomposition  such  that  all  domain  boundaries  are 
parallel  to  the  interface  normal  vectors  is  optimal  in  the  sense  that  no  information  does 
cross  these  boundaries.  Figure  3 shows  such  a decomposition  into  2,  4,  and  8 domains. 
This  domain  decomposition  is  also  optimal  with  regard  to  load  balancing.  Since  all  do- 
mains contain  the  exact  same  number  of  nodes  Gijk  < Go  as  well  as  G^k  > Go,  the  load 
imbalance  factor  is  Fi  = 0,  see  Eq.  (5.5)  defined  later. 
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Figure  4.  Speedup  due  to  parallel  FMM  employing  the  narrow  band  approach  (left)  and  whole 
domain  update  (right)  with  the  optimal  decomposition  into  1,  2,  4,  and  8 domains.  Dashed  line 
denotes  theoretically  possible  speedup,  straight  line  represents  attained  speedup. 


Figure  5.  Overhead  due  to  parallel  FMM  employing  the  narrow  band  approach  (left)  and  whole 
domain  update  (right)  with  the  optimal  decomposition  into  1,  2,  4,  and  8 domains.  Shown  are 
communication  factor  Fc  («  ) and  rollback  factor  Fr  (■  ). 


Two  different  sets  of  computations  were  performed.  The  first  set  employs  the  narrow 
band  approach  with  the  width  of  the  band  set  to  T = 8Ax.  The  second  set  performs  the 
FMM  throughout  the  whole  computational  domain. 

Figure  4 shows  a comparison  of  the  speedup  for  both  cases  to  the  theoretically  pos- 
sible speedup.  The  theoretically  possible  speedup  was  determined  by  calculating  only 
domain  number  0,  see  Fig.  3,  using  Neumann  boundary  conditions.  Note  that,  since  the 
operation  count  of  the  sequential  FMM  is  O(NlogN),  slightly  hyperlinear  speedup  is 
theoretically  possible.  However,  the  actual  efficiency  attained  in  the  parallel  computa- 
tion is  approximately  0.96  for  the  narrow  band  approach  and  roughly  0.98  for  the  whole 
domain  update. 

Figure  5 illustrates  the  overhead  due  to  the  parallelization.  Depicted  are  the  commu- 
nication factor  Fc,  defined  as 

total  number  of  communication  operations 


total  number  of  nodes 


(5.3) 
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Figure  6.  Non-optimal  domain  decomposition  into  3,  9,  and  27  domains. 


FIGURE  7.  Speedup  due  to  parallel  FMM  employing  the  narrow  band  approach  (left)  and 
whole  domain  update  (right)  with  the  non-optimal  decomposition  into  1,  3,  9,  and  27  domains. 


and  the  rollback  factor  Fr,  defined  as 

p _ total  number  of  rollback  operations 
" r total  number  of  nodes 

As  expected,  with  optimal  domain  decomposition,  the  amount  of  communication  and 
rollback  operations  is  very  small,  indicating  that  the  overhead  due  to  parallelization  is 
minimal.  This  result  is  consistent  with  the  observed  high  efficiency. 

5.2.  Non-optimal  domain  decomposition 

When  applying  the  parallel  FMM  to  actual  problems,  an  optimal  domain  decomposition, 
as  presented  in  the  previous  section,  is  not  always  viable.  In  this  non-optimal  case,  three 
major  factors  can  impact  the  performance  of  the  parallel  algorithm:  load  imbalancing, 
communication  overhead,  and  rollback  overhead.  To  illustrate  their  effect,  two  sets  of 
computations  are  performed  using  a decomposition  into  1,  3,  9,  and  27  domains  as  de- 
picted in  Fig.  6.  The  first  set  again  uses  the  narrow  band  approach  with  the  width  of  the 
band  set  to  T = 8 Ax,  whereas  the  second  set  calculates  the  FMM  throughout  the  whole 
computational  domain. 

Figure  7 shows  the  speedup  attained  for  both  the  narrow  band  approach  on  the  left  and 
the  whole  domain  update  on  the  right.  As  can  be  seen,  speedup,  and  thus  efficiency,  is 
significantly  lower  than  the  optimal  domain  decomposition  case,  see  Fig.  4.  The  efficiency 
decreases  to  0.17  for  the  narrow  band  approach  and  0.34  for  the  whole  domain  update. 


Figure  8.  Load  imbalance  factor  Ft  due  to  parallel  FMM  employing  the  narrow  band  approach 
(left)  and  whole  domain  update  (right)  with  the  non-optimal  decomposition  into  1,  3,  9,  and  27 
domains.  Shown  is  Ft  for  Gijk  < Go  (s  ) and  Ft  for  Gijk  > Go  (■  ). 


FIGURE  9.  Overhead  due  to  parallel  FMM  employing  the  narrow  band  approach  (left)  and  whole 
domain  update  (right)  with  the  non-optimal  decomposition  into  1,  3,  9,  and  27  domains.  Shown 
are  communication  factor  Fc  («  ) and  rollback  factor  Fr  (■  ). 


To  further  analyze  this  behavior,  the  load  imbalance  factor 

p __  i maximum  number  of  nodes  per  domain 
average  number  of  nodes  per  domain 


(5.5) 


for  each  of  the  four  different  domain  decompositions  is  depicted  in  Fig.  8.  Since  the 
parallel  FMM  algorithm  is  called  twice,  once  for  all  nodes  G^k  < Go  and  once  for  all 
nodes  Gxjk  >G0,Fi  is  calculated  accordingly.  Obviously,  the  load  imbalance  factors  for 
both  sets  are  relatively  large.  This  explains  the  significantly  reduced  efficiency  of  the 
non-optimal  domain  decomposition  as  compared  to  the  optimal  domain  decomposition. 
Otherwise,  the  load  imbalance  factors  for  G^*  > Go  are  roughly  the  same  for  both  sets. 
However,  the  load  imbalance  factor  for  G^k  < Go  is  significantly  larger  in  the  narrow 
band  approach  than  in  the  whole  domain  update,  leading  to  the  lower  speedup  and 
efficiency  in  the  narrow  band  approach  as  compared  to  the  whole  domain  update. 

Figure  9 exhibits  the  communication  factor  Fc  and  the  rollback  factor  Fr.  Compared  to 
the  optimal  domain  decomposition  case  (Fig.  5)  both  factors  are  one  order  of  magnitude 
larger.  Furthermore,  Fc  and  Fr  are  significantly  larger  in  the  narrow  band  approach  as 
opposed  to  the  whole  domain  update.  This  is  due  to  the  aforementioned  higher  load 
imbalance  in  the  narrow  band  approach. 
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6.  Conclusions  and  future  work 


In  this  paper,  the  first  domain  decomposition  parallelization  of  the  Fast  Marching 
Method  for  level  sets  has  been  presented.  Parallel  speedup  has  been  demonstrated  in 
both  the  optimal  and  non-optimal  domain  decomposition  case.  The  parallel  performance 
of  the  proposed  method  is  strongly  dependent  on  load  balancing  separately  the  number 
of  nodes  on  each  side  of  the  interface.  A load  imbalance  of  nodes  on  either  side  of  the 
domain  leads  to  an  increase  in  communication  and  rollback  operations.  Furthermore, 
the  amount  of  inter-domain  communication  can  be  reduced  by  aligning  the  inter-domain 
boundaries  with  the  interface  normal  vectors.  In  the  case  of  optimal  load  balancing  and 
aligned  inter-domain  boundaries,  the  proposed  parallel  FMM  algorithm  is  highly  efficient, 
reaching  efficiency  factors  of  up  to  0.98. 

Future  work  will  focus  on  the  extension  of  the  proposed  parallel  algorithm  to  higher 
order  accuracy.  Also,  to  further  enhance  parallel  performance,  the  coupling  of  the  domain 
decomposition  parallelization  to  the  Go-based  parallelization  will  be  investigated. 
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